AITopics | textual concept

Table 2: The accuracy on the VQA v2.0 test set. We thank all the reviewers for the helpful comments. Q1: How the paper's contribution relates to the current SOT A? SGAE is a rather complicated scene-graph based method specific to image captioning. The results with current SOT A + MIA will be stated more clearly in the paper. Q2: How to use MIA on the baseline systems (i.e., how is MIA applied to image captioning For the settings, we have listed them in the supplementary materials.

mia, table 1, textual concept, (10 more...)

Neural Information Processing Systems

Genre: Research Report > Promising Solution (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.79)
Information Technology > Artificial Intelligence > Vision (0.78)

Add feedback

Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Cho, Wonwoong, Zhang, Yanxia, Chen, Yan-Ying, Inouye, David I.

arXiv.org Artificial IntelligenceJul-15-2025

Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.24085

Country: Europe (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP

Xu, Binyan, Dai, Xilin, Tang, Di, Zhang, Kehuan

arXiv.org Artificial IntelligenceJul-9-2025

Deep Neural Networks (DNNs) have achieved widespread success yet remain prone to adversarial attacks. Typically, such attacks either involve frequent queries to the target model or rely on surrogate models closely mirroring the target model -- often trained with subsets of the target model's training data -- to achieve high attack success rates through transferability. However, in realistic scenarios where training data is inaccessible and excessive queries can raise alarms, crafting adversarial examples becomes more challenging. In this paper, we present UnivIntruder, a novel attack framework that relies solely on a single, publicly available CLIP model and publicly available datasets. By using textual concepts, UnivIntruder generates universal, transferable, and targeted adversarial perturbations that mislead DNNs into misclassifying inputs into adversary-specified classes defined by textual concepts. Our extensive experiments show that our approach achieves an Attack Success Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly outperforming existing transfer-based methods. Additionally, we reveal real-world vulnerabilities, showing that even without querying target models, UnivIntruder compromises image search engines like Google and Baidu with ASR rates up to 84%, and vision language models like GPT-4 and Claude-3.5 with ASR rates up to 80%. These findings underscore the practicality of our attack in scenarios where traditional avenues are blocked, highlighting the need to reevaluate security paradigms in AI applications.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.1984

Country:

Asia > China (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

textual concept

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

d937cb3fe2851ed0ab9af5e38f885077-Paper-Conference.pdf

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

9fe77ac7060e716f2d42631d156825c0-AuthorFeedback.pdf

45f7927942098d14e473fc5d000031e2-Paper-Conference.pdf

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

d937cb3fe2851ed0ab9af5e38f885077-Supplemental-Conference.pdf

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Table 1 Evaluation of the state of the art model

Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP